IEEE/ACM Transactions on Computational Biology and Bioinformatics
● Institute of Electrical and Electronics Engineers (IEEE)
Preprints posted in the last 90 days, ranked by how well they match IEEE/ACM Transactions on Computational Biology and Bioinformatics's content profile, based on 32 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.
Fletcher, W. L.; Sinha, S.
Show abstract
The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.
Haque, N.; Mazed, A.; Ankhi, J. N.; Uddin, M. J.
Show abstract
Accurate classification of SARS-CoV-2 genomic variants is essential for effective genomic surveillance, yet it is challenged by extreme class imbalance, limited representation of rare variants, and distribution shifts in real-world sequencing data. In this study, we employed hybrid RF-SVM framework designed for robust detection of rare SARS-CoV-2 variants. It integrates a random forest and a polynomial-kernel based support vector machine to enhance sensitivity to minority classes while maintaining overall predictive stability. We systematically compared classical machine learning models, deep learning approaches, and hybrid strategies under both standard and distribution-shifted evaluation settings. Our results show that classical models using TF-IDF-based k-mer features outperform deep learning methods on macro-averaged performance metrics. The Random Forest classifier using TF-IDF Feature achieved the best overall performance, with a macro-averaged F1-score of 0.8894 and an accuracy of 96.3%. The model also demonstrated strong generalization ability, as evidenced by stable cross-validation performance (CV accuracy = 0.9637). Hybrid RF-SVM model further improves rare variant detection under severe class imbalance. Calibration analysis indicates reliable probability estimates for common variants, although challenges persist for minority classes. Overall, this study highlights the limitations of deep learning in highly imbalanced genomic settings and demonstrates that carefully designed hybrid machine learning approaches provide an effective and interpretable solution for rare SARS-CoV-2 variant detection.
Atabaigi Elmi, V.; Joeres, R.; Kalinina, O. V.
Show abstract
Enzymes are essential catalysts in many cellular processes. Understanding their interactions with small molecules, such as regulators, cofactors, and most importantly, substrates, is crucial for understanding the biochemical processes that occur in cells. Correctly interpreting the roles of small molecules that interact with enzymes is key to elucidating enzyme function. Recently, the field of enzyme-small molecule interaction prediction has gained more interest from computational and, especially, deep-learning methods, and numerous datasets and models with remarkable performances have been published. In this work, we critically examine one of the most popular datasets and three models trained on it, identifying leaked information that may overinflate reported model performance. We show that the inspected models are susceptible to information leakage, and their performance drops to near-random when the leakage is removed.
Mukherjee, S.; Srivastava, D.; Patra, N.
Show abstract
Protein-DNA complexes are involved in vital cellular functions like gene regulation, replication, transcription, packaging, rearrangement, and damage repair. In this work, streamlined geometric formalism for computing the absolute binding free energy was used to obtain chemical accurate in silico estimation of binding free energy of three Protein-DNA complexes. Additionally, molecular interactions between Protein and DNA involved hydrogen bonds, electrostatic, van der Waals, and hydrophobic interactions. Using this formalism, researcher can obtain the absolute binding free energy for a Protein-DNA complex with remarkable accuracy and modest computational cost.
Guler, F.; Goksuluk, D.; Xu, M.; Choudhary, G.; agraz, m.
Show abstract
Applying deep learning models to RNA-Seq data poses substantial challenges, primarily due to the high dimensionality of the data and the limited sample sizes. To address these issues, this study introduces an advanced deep learning pipeline that integrates feature engineering with data augmentation. The engineering application focuses on biomedical engineering, specifically the classification of RNA-Seq datasets for disease diagnosis. The proposed framework was initially validated on synthetic datasets generated from Naive Bayes, where MLP-based augmentation yielded a notable improvement in predictive performance. Building on this foundation, we applied the approach to chromophobe renal cell carcinoma (KICH) RNA-Seq data from The Cancer Genome Atlas (TCGA). Following standard preprocessing steps normalization, transformation, and dimensionality reduction, the analysis concentrated on three main aspects: augmentation strategies, preprocessing methods, and explainable AI (XAI) techniques in relation to classification outcomes. Feature selection was performed through PCA, Boruta, and RF-based methods. Three augmentation strategies linear interpolation, SMOTE, and MixUp were evaluated. To maintain methodological rigor, augmentation was applied exclusively to the training set, while the test set was held out for unbiased evaluation. Within this framework, we conducted a comparative assessment of multiple deep learning architectures, including MLP, GNN, and the recently proposed Kolmogorov-Arnold networks (KAN). The GNN achieved the highest classification accuracy (99.47%) when trained with MixUp augmentation combined with RF feature selection, and achieved the best F1 score (0.9948). Consequently, the GNN-based XAI framework was applied to the RF dataset enriched with MixUp. XAI analyses identified the top 20 most influential genes, such as HNF4A, DACH2, MAPK15, and NAT2, which played the greatest role in classification, thereby confirming the biological plausibility of the model outputs. To further validate model robustness, cervical cancer and Alzheimers RNA-Seq datasets were also tested, yielding consistent and reliable results. Overall, the findings highlight the value of incorporating data augmentation into deep learning models for RNA-Seq analysis, not only to improve predictive performance but also to enhance biological interpretability through explainable AI approaches.
Kusumoto, T.
Show abstract
In this study, we present an XAI-based genetic profiling framework that quantifies gene importance for distinguishing cancer cells from normal cells based on an interpretable AI decision process. We propose a new explainable AI (XAI) classification model that combines probabilistic circuits with the Nucleotide Transformer. By leveraging the strong feature-extraction capability of the Nucleotide Transformer, we design a tractable classification framework based on probabilistic circuits while preserving probabilistic interpretability. To demonstrate the capability of this framework, we used the GSE131907 single-cell lung cancer atlas and constructed a dataset consisting of cancer-cell and normal-cell classes. From each sample, 900 gene types were randomly selected and converted into embedding vectors using the Nucleotide Transformer, after which the classification model was trained. We then extracted class-specific probabilistic contributions from the tractable model and defined a contribution score for the cancer-cell class. Genetic profiling was performed based on these scores, providing insights into which genes and biological pathways are most important for the classification task. Notably, 1,524 of the 9,540 observed genes showed contribution scores that contradicted what would be expected from their class-wise occurrence frequencies, suggesting that the profiling goes beyond simple statistics by leveraging biological feature representations encoded by the Nucleotide Transformer. The top-ranked genes among these contradictory cases include several well-studied genes in cancer research (e.g., ITGA5, SIGLEC9, NOTUM, and TP73). Overall, these analyses go beyond traditional statistical or gene-expression-level approaches and provide new academic insights for genetic research.
Amin, R.; Rana, M. M. H.; Aktar, S.
Show abstract
Federated learning (FL) enables collaborative clinical model training without centralized data sharing, yet its deployment is hindered by statistical heterogeneity (non-IID data) and inherent class imbalance across healthcare institutions. Conventional aggregation strategies such as FedAvg and FedProx weight client updates solely by dataset size, ignoring class distributions and thereby biasing the global model toward the majority class. To address this, we propose Distribution-Aware Federated Learning (DA-FL), which introduces a minority-class amplification factor{phi} k computed as the ratio of a clients local positive class rate to the global positive class rate. Combined with class-weighted cross-entropy loss at the client level, DA-FL forms a two-level correction mechanism that mitigates imbalance without additional data sharing. Experiments on the CDC BRFSS 2021 diabetes dataset (236,378 records across five simulated clients under three non-IID levels) show that DA-FL improves F1-Macro by 18.2% and G-Mean by 26.7% over FedAvg under moderate non-IID conditions, while achieving 31-fold greater F1-Macro stability across 30 communication rounds. These findings demonstrate that DA-FL is an effective and practically deployable solution for federated clinical prediction under realistic non-IID and class-imbalanced settings.
Muneeb, M.; Ascher, D.
Show abstract
Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.
Koeksal, R.; Fritz, A.; Kumar, A.; Schmidts, M.; Tran, V. D.; Backofen, R.
Show abstract
Identifying genes associated with human diseases is essential for effective diagnosis and treatment. Experimentally identifying disease-causing genes is time-consuming and expensive. Computational prioritization methods aim to streamline this process by ranking genes based on their likelihood of association with a given disease. However, existing methods often report long ranked lists consisting of thousands of potential disease genes, often containing a high number of false positives. This fails to meet the practical needs of clinicians who require shorter, more precise candidate lists. To address this problem, we introduce DisGeneFormer (DGF), an end-to-end disease-gene prioritization pipeline. Our approach is based on two distinct graph representations, modeling gene and disease relationships, respectively. Each graph is first processed separately by graph attention and then jointly by a transformer module to combine within-graph and cross-graph knowledge through local and global attention. We propose an evaluation pipeline based on the precision of a top K ranked gene list, with K set to clinically feasible values between 5 and 50, relying solely on experimentally verified associations as ground truth. Our evaluation demonstrates that DGF substantially outperforms existing methods. We additionally assessed the influence of the negative data sampling strategy as well as analyses of the effect of graph topology and features on the performance of our model.
Mukherjee, P.; Mandal, S.
Show abstract
This paper describes MMP, a three-stage framework for systematic quantum optimization of constrained molecular docking problems. The protocol addresses the "formulation bottleneck"--the critical challenge of translating constrained optimization problems into valid QUBO (Quadratic Unconstrained Binary Optimization) formulations for quantum solvers. MMP replaces heuristic penalty tuning with data-driven calibration through: (1) classical solution-space analysis to validate fragment libraries before quantum deployment, (2) systematic penalty sweeps to identify optimal "Goldilocks Zone" coefficients, and (3) MAC-QAOA (MMP Adaptive Constraint QAOA) with layer-dependent penalty decay. Preliminary benchmarks on synthetic constrained optimization problems demonstrate 99.7% solution validity at identified elbow points and 25.5% improvement in solution quality over static-penalty QAOA. MMP is hardware-agnostic but designed for near-term devices including Pasqals Orion Gamma (140+ qubits). The theoretical framework, algorithmic details, and preliminary validation results of the protocol are discussed, establishing a systematic methodology for quantum-augmented optimization workflows for drug discovery. All benchmarks are conducted on synthetic constrained optimization instances that reproduce structural features of docking formulations; application to real molecular docking targets is left for future work.
Guan, J. S.; Wang, Z.; Mu, Y.
Show abstract
Protein-protein binding affinity is important for understanding protein interactions within a protein complex and for identifying strong drug-peptide binders to a target protein. Many structure-based models were built previously with reasonable performance. However, such models require protein complex structure as input, which is usually unavailable due to high cost and experimental constraints. To tackle such an issue, the sequence-based CrossAffinity model was constructed in this study, using the cross-attention module to extract contextual information of interacting protein components while separating the protein complex into two distinct parts to predict the protein-protein binding affinity. CrossAffinity managed to outperform all structure-based models and sequence-based models in an S34 test set containing newer protein complex structures and binding affinity values in a timeline while being trained on an older dataset, showing generalisability to new data points. In other test sets, namely S90, S90 subset and S79*, CrossAffinity also managed to outperform all other sequence-based models while maintaining comparable performance to many recently published structure-based models. The acceptable performance and quick inference of CrossAffinity enable it to be deployed in situations requiring the prediction of the binding affinity of many protein complexes that lack structural information.
Kumar, A.; Do, T. A.; Gruening, B.; Becker, H.; Backofen, R.
Show abstract
DNA methylation is a significant epigenetic modification involving the addition of a methyl group to the position 5' of the cytosine residues. The modification is responsible for disease progression, immune response, and outcomes in diseases such as breast cancer (BC) and acute myeloid leukemia (AML). Illuminas HumanMethylation450 BeadChip (450K) and EPIC BeadChip (850K) methylation arrays are heavily used for such cancer studies to determine differentially expressed and differentially methylated genomic regions. Many of these are biomarkers used effectively for exploring therapeutic targets. Several studies report a few potential biomarkers, but the enormous numbers of largely unexplored probe-level (CpG sites) methylation signals may contain additional significant biomarkers. To prioritise the under-explored and disease-specific CpG sites from DNA methylation arrays and potentially uncover novel biomarkers, we present the novel approach GraphMeX-plain, a graph neural network (GNN)-based approach with explainable AI module. The underlying graph neural network is a principal neighbourhood aggregation (PNA). The approach uses the biomarkers reported in recent studies to rank biomarkers from the unexplored set. A similarity graph between CpG sites (known and unexplored sets) is constructed using DNA methylation {beta} values from arrays, producing an interaction graph. Biomarkers from recent studies are used as seeds and from the unexplored CpG sites, highly-variable ones (excluding the seeds) are selected that vary significantly between conditions (BC patients and normal controls for breast cancer arrays). Using the combination of seed and highly-variable CpG sites, a positive-unlabeled approach, network-informed adaptive positive-unlabeled learning (NIAPU), is utilized to assign a set of soft labels to unknown CpG sites such as likely positive, weakly negative, likely negative, and reliable negative in the descending order of likelihood of CpG sites being potential biomarkers. The graph neural network, a multi-layer PNA, refines the soft label assignments and achieves a high F1 classification score (weighted) of 0.93 for BC and 0.91 for AML. The most likely set of CpG sites, classified under "likely positive", are further explored using GNNExplainer, an explainable AI approach. Subgraphs for likely positive CpG sites, predicted with high probabilities, are computed and their proximities to the original seed CpG sites are analysed. The CpG sites which are predicted as likely positives have close interactions to the seeds. The top likely positive CpG site for BC is cg13265740 (C6orf115) where gene C6orf115 is strongly associated with BC. For AML, the top likely positive predicted CpG site is cg23281527 (KLHDC7A) where gene KLHDC7A plays a strong role in the mechanism of AML. A high percentage of these likely positive predicted CpG sites for both BC and AML, which remained unseen by the GNN model during training, are highly relevant to them and can serve as potential therapeutic targets and prognostic values.
Zhou, M.; Zhang, M.; Wang, J.; Shao, C.; Yan, G.
Show abstract
Cardiovascular disease is one of the leading causes of death worldwide, with myocardial infarction (MI) being a major cause of both morbidity and mortality among cardiovascular patients. MI Patients face a higher risk of cardiovascular disease recurrence afterwards. Therefore, accurately predicting the risk of recurrence and identifying key risk factors are crucial for clinical decision-making. In this paper, we consider the interrelationships among cardiovascular factors from a systemic perspective. We first construct a differential network for each patient to capture individual-specific deviations in factor relationships and propose a novel method, termed Causal Factor-aware Graph Neural Network (CFGNN), which integrates factor interactions to predict the recurrence risk of MI patients while uncovering key risk factors from a causal perspective. Experimental results demonstrate that CFGNN performs well on hospital-derived datasets in real world, effectively identifying several key risk factors. This method not only deepens our understanding of cardiovascular disease, but also paves the way for more targeted and effective interventions.
Misra, S.; Roy, S.; Ray, S. S.
Show abstract
Genes with similar expression profiles often exhibit similar functional properties. An "integrated similarity score" (ISS) is developed by combining different expression similarity measures through weights, obtained using biological information, for improving gene similarity. The expression similarity measures are converted to the common framework of positive predictive value using functional annotation. A fitness function, called "fitness function using functional annotation of genes" (FFFAG), is also developed by minimizing the difference between functional similarity value and the ISS. The FFFAG is used to determine the weight combination of different similarity measures in ISS. In addition, an existing similarity measure, called TMJ (integrated similarity measure by multiplying Triangle and Jaccard similarity), is also modified to incorporate biological knowledge involving functional annotation. The results demonstrate that ISS is superior to individual similarity measure to find similar gene pairs. Further, the ISS predicts the functional categories of 40 unclassified yeast genes at p-value cutoff of 10-10 from 12 clusters. The associated code is accessible at http://www.isical.ac.in/[~]shubhra/ISS.html.
Muneeb, M. -; Ascher, D.; Myung, Y.; Feng, S.; Henschel, A.
Show abstract
Genotype-phenotype prediction plays a crucial role in identifying disease-causing single nucleotide polymorphisms and precision medicine. In this manuscript, we benchmark the performance of various machine/deep learning algorithms and polygenic risk score tools on 80 binary phenotypes extracted from the openSNP dataset. After cleaning and extraction, the genotype data for each phenotype is passed to PLINK for quality control, after which it is transformed separately for each of the considered tools/algorithms. To compute polygenic risk scores, we used the quality control measures for the test data and the genome-wide association studies summary statistic file, along with various combinations of clumping and pruning. For the machine learning algorithms, we used p-value thresholding on the training data to select the single nucleotide polymorphisms, and the resulting data was passed to the algorithm. Our results report the average 5-fold Area Under the Curve (AUC) for 29 machine learning algorithms, 80 deep learning algorithms, and 3 polygenic risk scores tools with 675 different clumping and pruning parameters. Machine learning outperformed for 44 phenotypes, while polygenic risk score tools excelled for 36 phenotypes. The results give us valuable insights into which techniques tend to perform better for certain phenotypes compared to more traditional polygenic risk scores tools.
UPPALURI, K. R.; CHALLA, H. J.; VEMPATI, K. K.; KADALI, L. N.; PALASAMUDRAM, K.; RAYALA, M.
Show abstract
Coronary artery disease (CAD) is a multifactorial condition influenced by genetic, phenotypic, and environmental factors. Traditional risk prediction models fall short in capturing the polygenic complexity of CAD, particularly in underrepresented populations. This study presents SIGMA (Scoring Importance of Genes specific to disease using Machine learning Algorithms), a novel AI-powered framework that enhances CAD risk prediction by integrating genomic and phenotypic data. Our approach leverages GEMS (GeneConnectRx Evidence Metrics), an LLM-driven system to score 1772 CAD-associated genes, and CASCADE (Comprehensive Assessment of Sequence and Clinical Annotation Data Evaluation), a tiered variant scoring pipeline. Using whole exome sequencing (WES) data from 1,243 individuals (628 controls, 615 CAD cases), the model integrates age and gender as key non-modifiable phenotypes. Results show significant improvements in sensitivity (from 0.41 to 0.79), specificity (0.70 to 0.72), and AUC (0.59 to 0.81) when phenotype data are incorporated. Our findings highlight the potential of AI-integrated genomics for population-specific CAD risk stratification.
Duarte, S. A.; Mehdiabadi, M.; Bugnon, L. A.; Aspromonte, M. C.; Piovesan, D.; Milone, D. H.; Tosatto, S.; Stegmayer, G.
Show abstract
Intrinsically disordered proteins (IDPs) play an important role in a wide range of biological functions and are linked to several diseases. Due to technical difficulties and the high cost of experimental determination of disorder in proteins, combined with the exponential increase of unannotated protein sequences, the development of computational methods for disorder prediction became an active area of research in the last few decades. In this work, we present emb2dis, a deep learning model that uses protein language models (pLMs) to predict disorder from sequence. The emb2dis tool is a pre-trained model that receives as input a protein sequence, calculates its pLM embedding and passes it to a deep learning model. In contrast to existing approaches, emb2dis integrates informative sequence representations with a novel architecture that combines residual networks (ResNets) and dilated convolutions. This design effectively enlarges the receptive field of the convolution operation, enabling the model to better capture an extended context of each amino acid. At the output, emb2dis assigns a disorder propensity score to each residue in the sequence. The model was evaluated on datasets from the latest CAID3 blind benchmark for disorder prediction, where it achieved first place in the Disorder-PDB category, exhibiting strong performance with high AUC and Fmax scores. Additionally, it ranked among the top ten methods on the Disorder-NOX dataset. We provide a freely available web-demo for emb2dis and a source code repository for local installation. Weblink for the toolhttps://sinc.unl.edu.ar/web-demo/emb2dis/ The importance of the emb2dis tool is that it provides a new deep learning approach and significant improvements in the prediction of protein disorder, with a simple web interface and graphical output detailing per-residue disorder.
Kumar, S.; Zambreno, J.; Khokhar, A.; Akram, S.; Saeed, F.
Show abstract
Improving the speed and efficiency of database search algorithms that deduce peptides from mass spectrometry (MS) data has been an active area of research for more than three decades. The significance of the need for faster database search methods has rapidly increased due to the growing interest in studying non-model organisms, meta-proteomics, and proteogenomic data, which are notorious for their enormous search space. Poor scalability of serial algorithms with the growing size of the database and increasing parameters of post-translational modifications is a widely recognized problem. While high-performance computing techniques can be used on supercomputing machines, the need for real-time, on-the-instrument solutions necessitates the development of an efficient sytem-on-chip that optimizes design constraints such as cost, performance, and power of the system. To show case that such a system can work, we present an FPGA-based computational framework called FiCOPS to accelerate database search using a hardware/software co-design methodology. First, we theoretically analyze the database-search algorithm (closed-search) to reveal opportunities for parallelism and uncover computational bottlenecks. We then design an FPGA-based architectural template to exploit parallelism inherent in the search workload. We also formulate an analytical performance model for the architecture template to perform rapid design space exploration and find a near-optimal accelerator configuration. Finally, we implement our design on the Intel Stratix 10 FPGA platform and evaluate it using real-world datasets. Our experiments demonstrate that FiCOPS achieves 3.5 x speed-up over existing CPU solutions and 3x and 5x reduction in power consumption compared to existing CPU and GPU solutions.
Surkanti, S. R.; Kasturi, V. V.; Saligram, S. S.; Basangari, B. C.; Kondaparthi, V.
Show abstract
RNA interference (RNAi) is a crucial biological post-transcriptional gene silencing mechanism where small interfering RNA (siRNA) guides RNA-induced silencing complex (RISC) to bind with messenger RNA (mRNA) thereby silencing it and stopping protein formation. We exploit this process to prevent the formation of harmful proteins by silencing mRNA before it is translated into protein through an effective siRNA. There exists a need to develop a computational model that predicts the effectiveness of siRNA on a given mRNA. Designing a model is challenging, as the data availability is either scarce or biased, and existing models lack generalization ability, even though the parameters to training samples ratio is very high. To overcome these challenges, we introduce RNAiSpline, which incorporates self-supervised pretraining and fine-tuning with Kalmogorov-Arnold Network (KAN), Convolutional Neural Network (CNN), and Transformer Encoder. Evaluation on the independent test dataset yields an ROC-AUC of 0.8175, an F1 score of 0.7717, and Pearson correlation of 0.6032, making RNAiSpline a robust model for siRNA efficacy prediction.
Zhang, X.; Fang, Z.; Tang, K.; Chen, H.; Li, J.
Show abstract
Targeted drug therapies offer a promising approach for treating complex diseases, with combinational drug therapies often employed to enhance therapeutic efficacy. However, unintended drug-drug interactions may undermine treatment outcomes or cause adverse side effects. In this work, we propose a novel joint learning framework for the simultaneous prediction of effective drug combinations and drug-drug interactions, based on coupled tensor-tensor factorization. Specifically, we model drug combination therapies and DDI by representing drug-drug-disease associations and drug-drug interaction profiles as coupled three-way tensors. To address the challenges of data incompleteness and sparsity, the proposed model integrates auxiliary drug similarity information, such as chemical structure similarities, drug-specific side effects, drug target profiles, and drug inhibition data on cancer cell lines, within a multi-view learning frame-work. For optimization, we adopt a modified Alternating Direction Method of Multipliers (ADMM) algorithm that ensures convergence while enforcing non-negativity constraints. In addition to standard tensor completion tasks, we further evaluate the proposed method under a more realistic new-drug prediction setting, where all interactions involving a previously unseen drug are withheld. This scenario closely aligns with real-world applications, in which reliable predictions for emerging or under-studied compounds are essential. We evaluate the proposed method on a comprehensive dataset compiled from multiple sources, including DrugBank, CDCDB, SIDER, and PubChem. Our experiments show that SI-ADMM maintains robust performance and achieves the best results comparing to other tensor factorization approaches, with or without auxiliary information, particularly in the new-drug prediction setting. The implementation of our method is publicly available at: https://github.com/Xiaoge-Zhang/SI-ADMM.